MPI Runtime Error Detection with MUST: Advanced Error Reports
نویسندگان
چکیده
The Message Passing Interface (MPI) is a widely used paradigm for distributed memory programming. Its API is primarily designed for good performance and less for usability; it provides only very limited abstractions that help enforce its correct use. As a result, application developers need tools that aid in the detection and removal of MPI usage errors. Our runtime error detection tool MUST addresses this issue and provides a wide range of automatic correctness checks. MUST uses state-of-the-art approaches to cope with complex MPI semantics like derived datatypes, collective operations, and wildcard receive operations. However, equally important to detecting correctness violations, is that such correctness tools present all details of the violating MPI call(s) required to pinpoint the problem in the source code and to remove the error. In this paper we focus on the error reports presented by MUST and propose a new set of error reports that present complex errors with fine-grained details of the error situation. This includes a deadlock view and a view for usage errors in complex MPI datatypes.
منابع مشابه
MUST: A Scalable Approach to Runtime Error Detection in MPI Programs
The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone. Several MPI runtime correctness tools address classes of usage errors, such as deadlocks or non-portable constructs. To our knowledge none of these tools scales to more than about 100 processes. However, some of the current HPC systems use more than 100,000 cores and future systems are expected ...
متن کاملRuntime MPI Correctness Checking with a Scalable Tools Infrastructure
Increasing computational demand of simulations motivates the use of parallel computing systems. At the same time, this parallelism poses challenges to application developers. The Message Passing Interface (MPI) is a de-facto standard for distributed memory programming in high performance computing. However, its use also enables complex parallel programing errors such as races, communication err...
متن کاملEvaluating the Capability of Compilers and Tools to Detect Serial and Parallel Run-time Errors
The ability of system software to detect and issue error messages that help programmers quickly fix serial and parallel run-time errors is an important productivity criterion for developing and maintaining application programs. Over ten thousand run-time error tests and a run-time error detection (RTED) evaluation tool have been developed for the automatic evaluation of run-time error detection...
متن کاملCollective Error Detection for MPI Collective Operations
An MPI profiling library is a standard mechanism for intercepting MPI calls by applications. Profiling libraries are so named because they are commonly used to gather performance data on MPI programs. Here we present a profiling library whose purpose is to detect user errors in the use of MPI’s collective operations. While some errors can be detected locally (by a single process), other errors ...
متن کاملSurvey of Error and Fault Detection Mechanisms
This report describes diverse error detection mechanisms that can be utilized within a resilient system to protect applications against various types of errors and faults, both hard and soft. These detection mechanisms have different overhead costs in terms of energy, performance, and area, and also differ in their error coverage, complexity, and programmer effort. In order to achieve the highe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012